DS5110 Final Project Assignment

The Ed Squad

Shilpa Narayan (smn7ba)
Ashlie Ossege (ajo5fs)
Jamie Oh (hso6b)
Isaac Stevens (is3sb)

About data

AMERICAN COMMUNITY SURVEY 2015-2019 5-YEAR SAMPLE 5-in-100 national random sample of the population Contains all households and persons from the 1% ACS samples for 2015, 2016, 2017, 2018, and 2019 identifiable by year. The data include persons in group quarters. This is a weighted sample. The smallest identifiable geographic unit is the PUMA, containing at least 100,000 persons. PUMAs do not cross state boundaries.

The lowest unit of geography in the microdata files is still the PUMA. PUMAs contain at least 100,000 people. Aggregate data (but not microdata) is currently available from the Census Bureau for geographic areas as small as block groups, but only for the entire 2005-2009 period.

PERNUM numbers all persons within each household consecutively in the order in which they appear on the original census or survey form. When combined with SAMPLE and SERIAL, PERNUM uniquely identifies each person within the IPUMS.

MULTYEAR identifies the actual year of survey in multi-year ACS/PRCS samples.

For example, the 3-year ACS and PRCS data files each include cases from three single-year files. For these multi-year samples, the YEAR variable identifies the last year of data (2007 for the 2005-2007 3-year data; 2008 for the 2006-2008 data; and so on). MULTYEAR gives the single-year sample from which the case was drawn (2005, 2006, or 2007 for the 2005-2007 3-year data; 2006, 2007, or 2008 for the 2006-2008 3-year data; and so on).

https://usa.ipums.org/usa/acs_multyr.shtml

Read In Data

Preprocess Data

EDA

Full Data EDA

Education EDA

Gender EDA

Balance the data for similar number of EDUC FLAG

EDA On Sampled Data

Model Construction

Baseline Logistic Regression Model With Only 36 Features

Evaluate Baseline Model

Vector Assemble all features

Split Data Into Train Test Split

Scale Data to Prepare for PCA

PCA

Logistic Regression Pipeline and Evaluation

Models

Param Grids

Support Vector Machine Pipeline, Model, Cross Validation and Evaluation

Gradient Boosting Pipeline, Model, Cross Validation and Evaluation

Random Forest Pipeline, Model, Cross Validation and Evaluation

Interpretation of PCA